2017 Edition
We can categorize machine learning algorithms into two main groups: supervised learning and unsupervised learning. With supervised learning algorithms, in order to predict unknown values for new data, we have to know the target value for many previously-seen examples. In contrast, unsupervised learning algorithms explore the data which has no target attribute to find some intrinsic structures in them.
Clustering is a technique for finding similar groups in data, called clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given.
In this notebook, we will use K-means, a very well-known clustering algorithm to detect anomaly network connections based on statistics about each of them. A thorough overview of K-means clustering, from a research perspective, can be found in the following wonderful tutorial.
We expect students to:
Clustering is a typical and well-known type of unsupervised learning. Clustering algorithms try to find natural groupings in data. Similar data points (according to some notion of similarity) are considered in the same group. We call these groups clusters.
K-Means clustering is a simple and widely-used clustering algorithm. Given value of $k$, it tries to build $k$ clusters from samples in the dataset. Therefore, $k$ is an hyperparameter of the model. The right value of $k$ is not easy to determine, as it highly depends on the data set and the way that data is featurized.
To measure the similarity between any two data points, K-means requires the definition of a distance function between data points. What is a distance? It is a value that indicates how close two data points are in their space. In particular, when data points lie in a $d$-dimensional space, the Euclidean distance is a good choice of a distance function, and is supported by MLLIB.
In K-means, a cluster is a group of points, with a representative entity called a centroid. A centroid is also a point in the data space: the center of all the points that make up the cluster. It's defined to be the arithmetic mean of the points. In general, when working with K-means, each data sample is represented in a $d$-dimensional numeric vector, for which it is easier to define an appropriate distance function. As a consequence, in some applications, the original data must be transformed into a different representation, to fit the requirements of K-means.
Given $k$, the K-means algorithm works as follows:
We can also terminate the algorithm when it reaches an iteration budget, which yields an approximate result. From the pseudo-code of the algorithm, we can see that K-means clustering results can be sensitive to the order in which data samples in the data set are explored. A sensible practice would be to run the analysis several times, randomizing objects order; then, average the cluster centers of those runs and input the centers as initial ones for one final run of the analysis.
One of the best ways to study an algorithm is trying implement it. In this section, we will go step by step to implement a simple K-means algorithm.

import numpy as np
# calculate distance between two d-dimensional points
def euclidean_distance(p1, p2):
return np.linalg.norm(np.array(p1)-np.array(p2))
# test our function
assert (round(euclidean_distance([1,2,3] , [10,18,12]), 2) == 20.45), "Function's wrong"
def find_closest_centroid(datapoint, centroids):
# find the index of the closest centroid of the given data point.
Dist_to_centroid = np.array(list(map(lambda x: euclidean_distance(x , datapoint), centroids)))
print (Dist_to_centroid)
cen_idx = np.where(Dist_to_centroid==Dist_to_centroid.min())[0]
return cen_idx
assert(find_closest_centroid( [1,1,1], [ [2,1,2], [1,2,1], [3,1,2] ] ) == 1), "Function's wrong"
np.random.seed(22324)
# randomize initial centroids
def randomize_centroids(data, k):
np.random.seed()
np.random.shuffle(data)
return data[:k]
assert(len(
randomize_centroids(
np.array([
np.array([2,1,2]),
np.array([1,2,1]),
np.array([3,1,2])
]),
2)) == 2), "Wrong function"
MAX_ITERATIONS = 100
# return True if clusters have converged , otherwise, return False
def check_converge(centroids, old_centroids, num_iterations, threshold=0):
# if it reaches an iteration budget
if num_iterations >= MAX_ITERATIONS:
return True
# check if the centroids don't move (or very slightly)
delta_dist_list = np.array(list(map(lambda i: euclidean_distance(centroids[i] , old_centroids[i]), range(0,len(centroids)))))
sumDist = delta_dist_list.sum()
if sumDist <= threshold:
return True
return False
# centroids: a list of centers
# cluster: a list of k elements. Each element i-th is a list of data points that are assigned to center i-th
def update_centroids(centroids, cluster):
centroids = np.array(list(map(lambda x: np.mean(x, axis=0), cluster)))
return centroids
# data : set of data points
# k : number of clusters
# centroids: initial list of centroids
def kmeans(data, k=2, centroids=None):
# randomize the centroids if they are not given
if not centroids:
centroids = randomize_centroids(data, k)
#old_centroids = centroids[:]
iterations = 0
while True:
iterations += 1
# init empty clusters
clusters = [[] for i in range(k)]
# assign each data point to the closest centroid
for datapoint in data:
# find the closest center of each data point
centroid_idx = find_closest_centroid(datapoint, centroids)
# assign datapoint to the closest cluster
clusters[centroid_idx].append(np.array(datapoint))
# keep the current position of centroids before changing them
old_centroids = centroids[:]
# update centroids
centroids = update_centroids(centroids, clusters)
# if the stop criteria are met, stop the algorithm
if check_converge(centroids, old_centroids, iterations, threshold=0) == True:
break
return centroids
Next, we will test our algorithm on Fisher's Iris dataset, and plot the resulting clusters in 3D.
# the sourcecode in this cell is inspired from
# https://gist.github.com/bbarrilleaux/9841297
%matplotlib inline
from sklearn import datasets, cluster
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn.decomposition import PCA
# load data
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
# do the clustering
centers = kmeans(X_iris, k=3)
labels = [find_closest_centroid(p, centers) for p in X_iris]
#plot the clusters in color
#fig = plt.figure(1, figsize=(8, 8))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
#plt.cla()
#ax.scatter(X_iris[:, 3], X_iris[:, 0], X_iris[:, 2], c=labels)
# moon
# np.random.seed(0)
# X, y = datasets.make_moons(2000, noise=0.2)
# blob
# np.random.seed(0)
# X, y = datasets.make_blobs(n_samples=2000, centers=3, n_features=20, random_state=0)
# centers = kmeans(X, k=3)
# labels = [find_closest_centroid(p, centers) for p in X]
# fig = plt.figure(1, figsize=(8, 8))
# plt.clf()
# plt.scatter(X[:,0], X[:,1], s=40, c=labels, cmap=plt.cm.Spectral)
#ax.w_xaxis.set_ticklabels([])
#ax.w_yaxis.set_ticklabels([])
#ax.w_zaxis.set_ticklabels([])
#ax.set_xlabel('Petal width')
#ax.set_ylabel('Sepal length')
#ax.set_zlabel('Petal length')
#plt.show()
# Here we use sci-kit learn implementation of K-means
centers2 =cluster.KMeans(n_clusters=3)
centers2.fit(X_iris)
labels2 = centers2.labels_
print('Centroids of self-built K-means: ', labels)
print('Centroids of Sci-kit learn K-means: ', labels2)
#plot the clusters in color
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
plt.cla()
ax.scatter(X_iris[:, 3], X_iris[:, 0], X_iris[:, 2], c=labels)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.title('Clustering on Iris Dataset')
plt.show()
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
score = pca.fit_transform(X_iris)
PCA(copy=True, n_components=3, whiten=False)
#plot the clusters in color
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
plt.cla()
ax.scatter(score[:, 2], score[:, 1], score[:, 0], c=labels)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.title('Clustering on Iris Dataset with PCA')
plt.show()
That's enough about K-means for now. In the next section, we will apply MMLIB's K-means on Spark to deal with a large data in the real usecase.
Some attacks attempt to flood a computer with network traffic. In some other cases, attacks attempt to exploit flaws in networking software in order to gain unauthorized access to a computer. Detecting an exploit in an incredibly large haystack of network requests is not easy.
Some exploit behaviors follow known patterns such as scanning every port in a short of time, sending a burst of request to a port... However, the biggest threat may be the one that has never been detected and classified yet. Part of detecting potential network intrusions is detecting anomalies. These are connections that aren't known to be attacks, but, do not resemble connections that have been observed in the past.
In this notebook, K-means is used to detect anomalous network connections based on statistics about each of them.
The data comes from KDD Cup 1999. The dataset is about 708MB and contains about 4.9M connections. For each connection, the data set contains information like the number of bytes sent, login attempts, TCP errors, and so on. Each connection is one line of CSV-formatted data, containing 38 features: back, buffer_overflow, ftp_write, guess_passwd, imap, ipsweep, land, loadmodule, multihop, neptune, nmap, normal, perl, phf, pod, portsweep, rootkit, satan, smurf, spy, teardrop, warezclient, warezmaster. For more details about each feature, please follow this link.
Many features take on the value 0 or 1, indicating the presence or absence of a behavior such as su_attempted in the 15th column. Some features are counts, like num_file_creations in the 17th columns. Some others are the number of sent and received bytes.
First, we need to import some packages that are used in this notebook.
import os
import sys
import re
from sklearn import datasets, cluster
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn.decomposition import PCA
from pyspark import SparkContext
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
from pyspark.mllib.clustering import KMeans, KMeansModel
input_path = "/datasets/k-means/kddcup.data"
raw_data = sc.textFile(input_path, 12)
There are two types of features: numerical features and categorical features. Currently, to get familiar with the data and the problem, we only use numerical features. In our data, we also have pre-defined groups for each connection, which we can use later as our "ground truth" for verifying our results.
Note 1: we don't use the labels in the training phase!!!
Note 2: in general, since clustering is un-supervised, you don't have access to ground truth. For this reason, several metrics to judge the quality of clustering have been devised. For a short overview of such metrics, follow this link. Note that computing such metrics, that is trying to assess the quality of your clustering results, is as computationally intensive as computing the clustering itself!

Where,
label is the pre-defined label of each connectionvector is a numpy array that contains values of all features, but the label and the categorial features at index 1,2,3 of each connection. Each vector is a data point.def parseLine(line):
cols = line.split(",")
# label is the last column
label = cols[-1]
# vector is every column, except the label
vector = cols[:-1]
# delete values of columns that have index 1->3 (categorical features)
vector[1:4]=[]
# convert each value from string to float
vector = np.array(list(map(lambda x: float(x), vector)))
return (label, vector)
labelsAndData = raw_data.map(lambda line: parseLine(line))
# we only need the data, not the labeldef parseLine(line):
data = labelsAndData.map(lambda x: x[1]).cache()
# number of connections
n = data.count()
print('number of connections: ', n)
print(data.take(1))
#Get all labels in the dataset
labels_data = labelsAndData.map(lambda x: x[0]).cache()
#Count the distribution of labels
list_classes = []
count_classes = []
for key, value in sorted(labels_data.countByValue().items(), key = lambda x: -x[1]):
list_classes.append(key)
count_classes.append(value)
print(key, value)
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
fig = {
'data': [{'labels': list_classes,
'values': count_classes,
'type': 'pie'}],
'layout': {'title': 'Distribution of classes in dataset '}
}
iplot(fig)

You can use the following parameters:
Discuss the result from your figure.
clusters = KMeans.train(data, 2, maxIterations=10, runs=10, initializationMode="random")
def PlotData(data, rate, clusters):
sampledData = data.sample(False, rate)
clusterCentroids = sc.parallelize(clusters.centers)
sampledDataCentroids = sampledData + clusterCentroids
arraysampledDataCentroids = np.array(sampledDataCentroids.take(sampledDataCentroids.count()))
print(arraysampledDataCentroids.shape)
Y_labels = sampledDataCentroids.map(lambda x: clusters.predict(x))
Y_labels = np.array(Y_labels.take(Y_labels.count()))
pca = PCA(n_components=3)
score = pca.fit_transform(arraysampledDataCentroids)
PCA(copy=True, n_components=3, whiten=False)
#plot the clusters in color
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
plt.cla()
ax.scatter(score[:,0],score[:,1], score[:,2], c=Y_labels)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('X ')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()
PlotData(data, 0.3, clusters)

from operator import add
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(clusters, point):
closest_center = clusters.centers[clusters.predict(point)]
return euclidean_distance(point, closest_center)
WSSSE = data.map(lambda point: error(clusters, point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

clusterLabelCount = sorted(labelsAndData.map(lambda x: (clusters.predict(x[1]), x[0])).countByValue().items(), key = lambda xx: -xx[1])
for item in clusterLabelCount:
print(item)
How many clusters are appropriate for a dataset? In particular, for our own dataset, it's clear that there are 23 distinct behavior patterns in the data, so it seems that k could be at least 23, or likely, even more. In other cases, we even don't have any information about the number of patterns at all (remember, generally your data is not labelled!). Our task now is finding a good value of $k$. For doing that, we have to build and evaluate models with different values of $k$. A clustering could be considered good if each data point were near to its closest centroid. One of the ways to evaluate a model is calculating the Mean of Squared Errors of all data points.

# k: the number of clusters
def clusteringScore(data, k):
clusters = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")
# calculate mean square error
score = data.map(lambda point: error(clusters, point)).mean()
print(score)
return score
kList = range(5,121,5)
scores = list(map(lambda xk: (xk , clusteringScore(data, xk)), kList))
Scores = [[5,4254.42275022],[10,2146.95298298],[15,1594.58913877],[20,1586.76156705],[25,1127.27444207],[30,1415.39753099],[35,1083.89091088] \
,[40,1137.41675317],[45,1071.05839445],[50,1041.85747064],[55,1055.67001132],[60,1114.48806286],[65,970.773383168],[70,967.701682823] \
,[75,933.200337438],[80,969.397024588],[85,944.599647518],[90,977.941682483],[95,975.380941916],[100,956.872406351],[105,982.226180595] \
,[110,943.145327694],[115,881.270158057],[120,947.155791838]]
!pip install --upgrade pip
!pip install plotly
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
npScores = np.array(Scores)
trace0 = go.Bar(
x=npScores[:,0],
y=npScores[:,1],
marker=dict(
color='rgb(158,202,225)',
line=dict(
color='rgb(8,48,107)',
width=1.5,
)
),
opacity=0.6
)
data1 = [trace0]
layout = go.Layout(
title='Figure 2: Mean squared error for each k',
xaxis=dict(
title='Number of K'
),
yaxis=dict(
title='Mean Square Error'
)
)
fig = go.Figure(data=data1, layout=layout)
iplot(fig, filename='text-hover-bar')
K-means clustering treats equally all dimensions/directions of the space and therefore tends to produce more or less spherical (rather than elongated) clusters. In this situation, leaving variances uneven is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.
In our notebook, since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing will remove this bias.
Each feature can be normalized by converting it to a standard score. This means subtracting the mean of the feature’s values from each value, and dividing by the standard deviation
$normalize_i=\frac{feature_i - \mu_i}{\sigma_i}$
Where,

def normalizeData(data):
# number of connections
n = data.count()
print("n = ", n)
# calculate the sum of each feature
sums = data.reduce(lambda x,y: x + y)
print("sum = ", sums)
# calculate means
means = sums/n
print("means = " , means)
# calculate the sum square of each feature
sumSquares = data.map(lambda x: x**2).reduce(lambda x,y:x+y)
print("sumSquares = ", sumSquares)
# calculate standard deviation of each feature
stdevs = np.sqrt((sumSquares/n - means**2))
stdevs[stdevs <= 0] = 1
print("stdevs = ", stdevs)
def normalize(point):
return (point - means) / stdevs
return data.map(lambda x : normalize(x))
normalizedData = normalizeData(data).cache()
print("normalized Data = ", normalizedData.take(5))

kList = range(60,111,10)
scores = list(map(lambda xk: (xk , clusteringScore(normalizedData, xk)), kList))
Scores = [[60,0.474390911372],[70,0.439719518963],[80,0.438664447222],[90,0.423802136193],[100,0.454752147422],[110,0.420780443995]]
npScores = np.array(Scores)
trace0 = go.Bar(
x=npScores[:,0],
y=npScores[:,1],
marker=dict(
color='rgb(158,202,225)',
line=dict(
color='rgb(8,48,107)',
width=1.5,
)
),
opacity=0.6
)
data2 = [trace0]
layout = go.Layout(
title='Figure 2: Mean squared error for each k after normalization',
xaxis=dict(
title='Number of K'
),
yaxis=dict(
title='Mean Square Error'
)
)
fig = go.Figure(data=data2, layout=layout)
iplot(fig, filename='text-hover-bar')

clusters = KMeans.train(data,110, maxIterations=10, runs=10, initializationMode="random")
PlotData(data, 0.3, clusters)
clusters = KMeans.train(normalizedData, 110, maxIterations=10, runs=10, initializationMode="random")
PlotData(normalizedData, 0.3, clusters)
In the previous section, we ignored the categorical features of our data: this is not a good idea, since these categorical features can be important in providing useful information for clustering. The problem is that K-means (or at least, the one we have developed and the one we use from MLLib) only work with data points in a metric space. Informally, this means that operations such as addition, subtraction and computing the mean of data points are trivial and well defined. For a more formal definition of what a metric space is, follow this link.
What we will do next is to transform each categorical feature into one or more numerical features. This approach is very widespread: imagine for example you wanted to use K-means to cluster text data. Then, the idea is to transform text data in $d$-dimensional vectors, and a nice way to do it is to use word2vec. If you're interested, follow this link to a nice blog post on the problem.
There are two approaches:
Approach 1: mapping one categorical feature to one numerical feature. The values in each categorical feature are encoded into unique numbers of the new numerical feature. For example, ['VERY HOT','HOT', 'COOL', 'COLD', 'VERY COLD'] will be encoded into [0,1,2,3,4,5]. However, by using this method, we implicit assume that the value of 'VERY HOT' is smaller than 'HOT'... This is not generally true.
Approach 2: mapping one categorical feature to multiple numerical features. Basically, a single variable with $n$ observations and $d$ distinct values, to $d$ binary variables with $n$ observations each. Each observation indicating the presence (1) or absence (0) of the $d^{th}$ binary variable. For example, ['house', 'car', 'tooth', 'car'] becomes
[
[1,0,0,0],
[0,1,0,0],
[0,0,1,0],
[0,0,0,1],
]
We call the second approach "one-hot encoding". By using this approach, we keep the same role for all values of categorical features.

def testparseLine(line, k):
cols = line.split(",")
# label is the last column
return cols[k]
cols1 = np.array(raw_data.map(lambda x: testparseLine(x, 1)).distinct().collect())
cols2 = np.array(raw_data.map(lambda x: testparseLine(x, 2)).distinct().collect())
cols3 = np.array(raw_data.map(lambda x: testparseLine(x, 3)).distinct().collect())
print(cols1)
print(cols2)
print(cols3)
def parseLineWithHotEncoding(line):
cols = line.split(",")
# label is the last column
label = cols[-1]
vector = cols[0:-1]
# the binary features that are encoded from the first categorial feature
featureOfCol1 = np.array([0] * len(cols1))
featureOfCol1[cols1 == cols[1]] = 1
# the binary features that are encoded from the second categorial feature
featureOfCol2 = np.array([0] * len(cols2))
featureOfCol2[cols2 == cols[2]] = 1
# the binary features that are encoded from the third categorial feature
featureOfCol3 = np.array([0] * len(cols3))
featureOfCol3[cols3 == cols[3]] = 1
# construct the new vector
#vector = ([np.array(vector[0]) + featureOfCol1 + featureOfCol2 + featureOfCol3 + np.array(vector[4:])])
vector = np.hstack((vector[0], featureOfCol1, featureOfCol2, featureOfCol3, vector[4:]))
# convert each value from string to float
vector = np.array(list(map(lambda x: float(x), vector)))
return (label, vector)
labelsAndData = raw_data.map(parseLineWithHotEncoding)
# we only need the data, not the label
data = labelsAndData.values().cache()
print(data.first())
normalizedData = normalizeData(data).cache()
print("normalized Data = ", normalizedData.take(1))

kList = range(80,161,10)
scores = list(map(lambda xk: (xk , clusteringScore(normalizedData, xk)), kList))
Scores = [[80,1.43921545452],[90,1.31055972898],[100,1.21564755766],[110,1.24053003769],[120,1.1027274355],[130,1.21412836334]\
,[140,1.09840907092],[150,1.13197126391],[160,1.03432902661]]
npScores = np.array(Scores)
trace0 = go.Bar(
x=npScores[:,0],
y=npScores[:,1],
marker=dict(
color='rgb(158,202,225)',
line=dict(
color='rgb(8,48,107)',
width=1.5,
)
),
opacity=0.6
)
data3 = [trace0]
layout = go.Layout(
title='Figure 3: Mean squared error for each k after normalization and categorical features',
xaxis=dict(
title='Number of K'
),
yaxis=dict(
title='Mean Square Error'
)
)
fig = go.Figure(data=data3, layout=layout)
iplot(fig, filename='text-hover-bar')
When we have a new connection data (e.g., one that we never saw before), we simply find the closest cluster for it, and use this information as a proxy to indicate whether the data point is anomalous or not. A simple approach to decide when there is an anomaly or not, amounts to measuring the new data point’s distance to its nearest centroid. If this distance exceeds some thresholds, it is anomalous.

clusters = KMeans.train(normalizedData, 160, maxIterations=10, runs=10, initializationMode="random")
# calculate mean square error
score = normalizedData.map(lambda point: error(clusters, point))
anoThreshold = score.top(100)[-1]
print(anoThreshold)
print(score.top(1)[0])
#We can calculate again means and stdevs for use in our function that map from raw data to error for ranking and filtering
n = data.count()
#print("n = ", n)
# calculate the sum of each feature
sums = data.reduce(lambda x,y: x + y)
#print("sum = ", sums)
# calculate means
means = sums/n
#print("means = " , means)
# calculate the sum square of each feature
sumSquares = data.map(lambda x: x**2).reduce(lambda x,y:x+y)
#print("sumSquares = ", sumSquares)
# calculate standard deviation of each feature
stdevs = np.sqrt((sumSquares/n - means**2))
stdevs[stdevs <= 0] = 1
#print("stdevs = ", stdevs)
#We can to map from each raw data line to its error for ranking and filtering
def errorFromRawData(clusters, line):
cols = line.split(",")
# label is the last column
label = cols[-1]
vector = cols[0:-1]
# the binary features that are encoded from the first categorial feature
featureOfCol1 = np.array([0] * len(cols1))
featureOfCol1[cols1 == cols[1]] = 1
# the binary features that are encoded from the second categorial feature
featureOfCol2 = np.array([0] * len(cols2))
featureOfCol2[cols2 == cols[2]] = 1
# the binary features that are encoded from the third categorial feature
featureOfCol3 = np.array([0] * len(cols3))
featureOfCol3[cols3 == cols[3]] = 1
# construct the new vector
vector = np.hstack((vector[0], featureOfCol1, featureOfCol2, featureOfCol3, vector[4:]))
# convert each value from string to float
vector = np.array(list(map(lambda x: float(x), vector)))
normalizedvector = (vector - means) / stdevs
closest_center = clusters.centers[clusters.predict(normalizedvector)]
return (euclidean_distance(normalizedvector, closest_center), line)
raw_data_error = raw_data.map(lambda line: errorFromRawData(clusters, line))
filteredData = raw_data_error.filter(lambda line: line[0] >= anoThreshold)
print(filteredData.count())
#The most anomalous data point
top1 = filteredData.top(1, key = lambda x: x[0])
print(top1)
plotDatas = filteredData.map(lambda x: x[1]).cache()
labelsAndData2 = plotDatas.map(parseLineWithHotEncoding).cache()
data2 = labelsAndData2.values().cache()
normalizedData2 = normalizeData(data2).cache()
def PlotFigureDataWithoutCenter(data, rate, clusters):
sampledData = data.sample(False, rate)
#print(sampledData.count())
clusterCentroids = sc.parallelize(clusters.centers)
#sampledDataCentroids = sampledData + clusterCentroids
sampledDataCentroids = sampledData
arraysampledDataCentroids = np.array(sampledDataCentroids.take(sampledDataCentroids.count()))
print(arraysampledDataCentroids.shape)
Y_labels = sampledDataCentroids.map(lambda x: clusters.predict(x))
Y_labels = np.array(Y_labels.take(Y_labels.count()))
pca = PCA(n_components=3)
score = pca.fit_transform(arraysampledDataCentroids)
PCA(copy=True, n_components=3, whiten=False)
#plot the clusters in color
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=200)
plt.cla()
ax.scatter(score[:,0],score[:,1], score[:,2], c=Y_labels)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('X ')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()
PlotFigureDataWithoutCenter(normalizedData2, 1, clusters)
PlotData(normalizedData2, 1, clusters)

Here are some additional information about the metrics we suggest to use:
def calEntropy(listValue):
ent = 0
n = len(listValue)
#print(n)
for x in set(listValue):
ent -= listValue.count(x)/n * np.log(listValue.count(x)/n);
return ent*n
# k: the number of clusters
def clusteringScoreEntropy(labels, data, k):
print(k)
clusters = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")
# calculate mean square error
closest_centerss = data.map(lambda x: clusters.predict(x))
closest_center_label = closest_centerss.zip(labels)
entScore = closest_center_label.groupByKey().mapValues(list).map(lambda x: calEntropy(x[1])).sum()/closest_center_label.count()
print(entScore)
return entScore
kList = range(80, 161, 10)
entScores = list(map(lambda xk: (xk , clusteringScoreEntropy(labels_data, normalizedData, xk)), kList))
Scores = [[80,0.0220152627366],[90,0.0257946195467],[100,0.0230592355487],[110,0.0249077107456],[120,0.0179280472809],[130,0.0165599799982]\
,[140,0.0183107227647],[150,0.0146660986153],[160,0.0181881936621]]
npScores = np.array(Scores)
trace0 = go.Bar(
x=npScores[:,0],
y=npScores[:,1],
marker=dict(
color='rgb(158,202,225)',
line=dict(
color='rgb(8,48,107)',
width=1.5,
)
),
opacity=0.6
)
data4 = [trace0]
layout = go.Layout(
title='Figure 4: Calculation of Entropy for each k',
xaxis=dict(
title='Number of K'
),
yaxis=dict(
title='Entropy'
)
)
fig = go.Figure(data=data4, layout=layout)
iplot(fig, filename='text-hover-bar')
from sklearn.metrics import silhouette_samples, silhouette_score
# k: the number of clusters
def clusteringScoreSilhouett (data, k):
print(k)
clusters = KMeans.train(data, k, maxIterations=10, runs=10, initializationMode="random")
# calculate mean square error
closest_centerss = data.map(lambda x: clusters.predict(x))
convertedData = np.array(data.take(data.count()))
convertedlabel = np.array(closest_centerss.take(closest_centerss.count()))
silScore = silhouette_score(convertedData, convertedlabel)
print(silScore)
return silScore
kList = range(80, 161, 10)
sampledNormalizedData = normalizedData.sample(False, 0.001)
silScore = list(map(lambda xk: (xk , clusteringScoreSilhouett(sampledNormalizedData, xk)), kList))
silScore = [[80,0.734957043125],[90,0.678320649238],[100,0.679236365658],[110,0.685957339385],[120,0.702924835775] \
,[130,0.699519765653],[140,0.733915148418],[150,0.696458797617],[160,0.729932417629]]
npScores = np.array(silScore)
trace0 = go.Bar(
x=npScores[:,0],
y=npScores[:,1],
marker=dict(
color='rgb(158,202,225)',
line=dict(
color='rgb(8,48,107)',
width=1.5,
)
),
opacity=0.6
)
data5 = [trace0]
layout = go.Layout(
title='Figure 5: Silhouette Score for each k',
xaxis=dict(
title='Number of K'
),
yaxis=dict(
title='Silhouette Score'
)
)
fig = go.Figure(data=data5, layout=layout)
iplot(fig, filename='text-hover-bar')
